Objectives of the project:
setwd('/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R/')
getwd()
## [1] "C:/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R"
Load the provided data set AirBnB (1).Rdata and observe
what is inside of it.
load(file='C:/Users/Gali/Desktop/DataScienceTechInstitute/17. Big data with R/AirBnB (1).Rdata')
test2 <- L
head(test2)
## id listing_url scrape_id last_scraped
## 1 4867396 https://www.airbnb.com/rooms/4867396 2.01607e+13 2016-07-03
## 2 7704653 https://www.airbnb.com/rooms/7704653 2.01607e+13 2016-07-04
## 3 2725029 https://www.airbnb.com/rooms/2725029 2.01607e+13 2016-07-04
## 4 9337509 https://www.airbnb.com/rooms/9337509 2.01607e+13 2016-07-03
## 5 12928158 https://www.airbnb.com/rooms/12928158 2.01607e+13 2016-07-04
## 6 5589471 https://www.airbnb.com/rooms/5589471 2.01607e+13 2016-07-04
## name
## 1 Appartement 60m2 Rue Legendre 75017
## 2 Appart au pied de l'arc de triomphe
## 3 Nice appartment in Batignolles
## 4 Charming flat near Batignolles
## 5 Spacious bedroom near the centre of Paris
## 6 Rare, Maison individuelle 200m2
## summary
## 1 Au 2ème étage d'un bel immeuble joli 2 pièces meublé comprenant: une grande pièce à vivre lumineuse, une chambre, une cuisine, salle de douche et WC séparé. Appartement très calme et lumineux. A proximité de nombreux commerces et transports.
## 2 Nous proposons cette appartement situé en plein coeur de Paris, au pied de l'arc de triomphe. Commerçants, métro, cinéma, vous trouverez à proximité tout ce qu'il faut pour passer quelques jours à Paris en amoureux, entre copains ou en famille !
## 3 Located in the very charming Batignolles, this cozy and bright two-room appartment will perfectly suit your stay in Paris.
## 4 Welcome to my apartment ! This a quiet and cosy flat with 2 room (25 sqm2) fully furnished closed to trendy Batignolles area in the heart of the 17th district. (Near Montmartre foothill / Place de Clichy).
## 5 Spacious, quiet and bright room, ideal to explore and enjoy
## 6 Maison individuelle, 200 m2 habitable,rénovée en 2013. Quartier résidentiel, nombreux commerces, restaurants. Maison familiale, pouvant accueillir 5 adultes et un enfant (1 lit en hauteur).
## space
## 1
## 2 L'appartement est composé de : - une grande chambre (environ 15m2) avec un lit simple et d'un matelas d'appoint - une salle de bain avec douche, lave linge/sèche linge - un autre chambre (environ 10m2) avec un lit double (lit gigogne) et une salle de bain dans la chambre (douche) - un grand salon avec une cuisine ouverte (environ 35 m2) - wc séparé Le cuisine est tout équipé : machine nespresso, cocotte-minute, mixeur, lave vaisselle... L'appartement est très lumineux puisqu'il donne sur une avenue large mais calme. Vous trouverez à proximité plein de commercants, de bar pour sortir, de restaurants, des cinémas, des musées. Vous serez au coeur de la ville ! N'hésitez pas à nous contacter pour plus d'information, de photos...
## 3
## 4
## 5
## 6
## description
## 1 Au 2ème étage d'un bel immeuble joli 2 pièces meublé comprenant: une grande pièce à vivre lumineuse, une chambre, une cuisine, salle de douche et WC séparé. Appartement très calme et lumineux. A proximité de nombreux commerces et transports.
## 2 Nous proposons cette appartement situé en plein coeur de Paris, au pied de l'arc de triomphe. Commerçants, métro, cinéma, vous trouverez à proximité tout ce qu'il faut pour passer quelques jours à Paris en amoureux, entre copains ou en famille ! L'appartement est composé de : - une grande chambre (environ 15m2) avec un lit simple et d'un matelas d'appoint - une salle de bain avec douche, lave linge/sèche linge - un autre chambre (environ 10m2) avec un lit double (lit gigogne) et une salle de bain dans la chambre (douche) - un grand salon avec une cuisine ouverte (environ 35 m2) - wc séparé Le cuisine est tout équipé : machine nespresso, cocotte-minute, mixeur, lave vaisselle... L'appartement est très lumineux puisqu'il donne sur une avenue large mais calme. Vous trouverez à proximité plein de commercants, de bar pour sortir, de restaurants, des cinémas, des musées. Vous serez au coeur de la ville ! N'hésitez pas à nous contacter pour plus d'information, de photos...
## 3 Located in the very charming Batignolles, this cozy and bright two-room appartment will perfectly suit your stay in Paris.
## 4 Welcome to my apartment ! This a quiet and cosy flat with 2 room (25 sqm2) fully furnished closed to trendy Batignolles area in the heart of the 17th district. (Near Montmartre foothill / Place de Clichy).
## 5 Spacious, quiet and bright room, ideal to explore and enjoy
## 6 Maison individuelle, 200 m2 habitable,rénovée en 2013. Quartier résidentiel, nombreux commerces, restaurants. Maison familiale, pouvant accueillir 5 adultes et un enfant (1 lit en hauteur).
## experiences_offered neighborhood_overview notes transit access interaction
## 1 none
## 2 none
## 3 none
## 4 none
## 5 none
## 6 none
## house_rules
## 1
## 2
## 3
## 4
## 5
## 6
## thumbnail_url
## 1
## 2 https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=small
## 3
## 4
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=small
## 6
## medium_url
## 1
## 2 https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=medium
## 3
## 4
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=medium
## 6
## picture_url
## 1 https://a1.muscache.com/im/pictures/61090424/02c8a8bb_original.jpg?aki_policy=large
## 2 https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=large
## 3 https://a1.muscache.com/im/pictures/96821426/ea9864f1_original.jpg?aki_policy=large
## 4 https://a2.muscache.com/im/pictures/5fa65f2d-b159-4fb5-986a-bd36cb92d2bc.jpg?aki_policy=large
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=large
## 6 https://a2.muscache.com/im/pictures/69589240/79d976c4_original.jpg?aki_policy=large
## xl_picture_url
## 1
## 2 https://a1.muscache.com/im/pictures/97911969/ef37b496_original.jpg?aki_policy=x_large
## 3
## 4
## 5 https://a2.muscache.com/im/pictures/df47511b-0e86-4dcb-9887-569489b16020.jpg?aki_policy=x_large
## 6
## host_id host_url host_name host_since
## 1 9703910 https://www.airbnb.com/users/show/9703910 Matthieu 2013-10-29
## 2 35777602 https://www.airbnb.com/users/show/35777602 Claire 2015-06-14
## 3 13945253 https://www.airbnb.com/users/show/13945253 Vincent 2014-04-06
## 4 5107123 https://www.airbnb.com/users/show/5107123 Julie 2013-02-16
## 5 51195601 https://www.airbnb.com/users/show/51195601 Daniele 2015-12-13
## 6 28980052 https://www.airbnb.com/users/show/28980052 Philippe 2015-03-08
## host_location
## 1 Nantes, Pays de la Loire, France
## 2 Paris, Île-de-France, France
## 3 Paris, Île-de-France, France
## 4 Paris, Île-de-France, France
## 5 Prato, Toscana, Italy
## 6 Paris, Île-de-France, France
## host_about
## 1
## 2
## 3
## 4 Nous sommes un jeune couple vivant à Paris. Nous aimons beaucoup voyager
## 5
## 6
## host_response_time host_response_rate host_acceptance_rate host_is_superhost
## 1 N/A N/A N/A f
## 2 N/A N/A N/A f
## 3 within an hour 100% N/A f
## 4 within a day 50% N/A f
## 5 within an hour 100% 60% f
## 6 N/A N/A N/A f
## host_thumbnail_url
## 1 https://a0.muscache.com/im/users/9703910/profile_pic/1383073563/original.jpg?aki_policy=profile_small
## 2 https://a1.muscache.com/im/users/35777602/profile_pic/1438688930/original.jpg?aki_policy=profile_small
## 3 https://a0.muscache.com/im/users/13945253/profile_pic/1396781528/original.jpg?aki_policy=profile_small
## 4 https://a1.muscache.com/im/users/5107123/profile_pic/1425849895/original.jpg?aki_policy=profile_small
## 5 https://a2.muscache.com/im/pictures/e984ba68-7571-46d9-99dc-735ec6e5c9d6.jpg?aki_policy=profile_small
## 6 https://a0.muscache.com/im/users/28980052/profile_pic/1425844331/original.jpg?aki_policy=profile_small
## host_picture_url
## 1 https://a0.muscache.com/im/users/9703910/profile_pic/1383073563/original.jpg?aki_policy=profile_x_medium
## 2 https://a1.muscache.com/im/users/35777602/profile_pic/1438688930/original.jpg?aki_policy=profile_x_medium
## 3 https://a0.muscache.com/im/users/13945253/profile_pic/1396781528/original.jpg?aki_policy=profile_x_medium
## 4 https://a1.muscache.com/im/users/5107123/profile_pic/1425849895/original.jpg?aki_policy=profile_x_medium
## 5 https://a2.muscache.com/im/pictures/e984ba68-7571-46d9-99dc-735ec6e5c9d6.jpg?aki_policy=profile_x_medium
## 6 https://a0.muscache.com/im/users/28980052/profile_pic/1425844331/original.jpg?aki_policy=profile_x_medium
## host_neighbourhood host_listings_count host_total_listings_count
## 1 Batignolles 1 1
## 2 Champs-Elysées 1 1
## 3 Batignolles 1 1
## 4 Batignolles 1 1
## 5 Ternes 1 1
## 6 Batignolles 1 1
## host_verifications host_has_profile_pic
## 1 ['email', 'phone', 'reviews'] t
## 2 ['email', 'phone', 'reviews'] t
## 3 ['email', 'phone', 'reviews'] t
## 4 ['email', 'phone', 'reviews', 'jumio'] t
## 5 ['email', 'phone', 'reviews', 'jumio'] t
## 6 ['email', 'phone'] t
## host_identity_verified street
## 1 f Rue Legendre, Paris, Île-de-France 75017, France
## 2 f Avenue Mac-Mahon, Paris, Île-de-France 75017, France
## 3 f Rue la Condamine, Paris, Île-de-France 75017, France
## 4 t Rue Gauthey, Paris, Île-de-France 75017, France
## 5 t Avenue Brunetière, Paris, Île-de-France 75017, France
## 6 f Rue de Saussure, Paris, Île-de-France 75017, France
## neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed city
## 1 Batignolles Batignolles-Monceau NA Paris
## 2 Champs-Elysées Batignolles-Monceau NA Paris
## 3 Batignolles Batignolles-Monceau NA Paris
## 4 Batignolles Batignolles-Monceau NA Paris
## 5 Ternes Batignolles-Monceau NA Paris
## 6 Batignolles Batignolles-Monceau NA Paris
## state zipcode market smart_location country_code country latitude
## 1 Île-de-France 75017 Paris Paris, France FR France 48.88880
## 2 Île-de-France 75017 Paris Paris, France FR France 48.87664
## 3 Île-de-France 75017 Paris Paris, France FR France 48.88384
## 4 Île-de-France 75017 Paris Paris, France FR France 48.89236
## 5 Île-de-France 75017 Paris Paris, France FR France 48.88942
## 6 Île-de-France 75017 Paris Paris, France FR France 48.88707
## longitude is_location_exact property_type room_type accommodates
## 1 2.320466 t Apartment Entire home/apt 2
## 2 2.293724 t Apartment Entire home/apt 4
## 3 2.321031 t Apartment Entire home/apt 2
## 4 2.322338 t Apartment Entire home/apt 2
## 5 2.298321 t Apartment Private room 2
## 6 2.312212 t House Entire home/apt 6
## bathrooms bedrooms beds bed_type
## 1 1 1 1 Real Bed
## 2 2 2 3 Real Bed
## 3 1 1 1 Real Bed
## 4 1 1 1 Real Bed
## 5 1 1 1 Real Bed
## 6 3 4 4 Real Bed
## amenities
## 1 {TV,"Cable TV",Internet,"Wireless Internet",Kitchen,Heating,Washer,Dryer,Essentials}
## 2 {"Wireless Internet",Kitchen,"Elevator in Building","Buzzer/Wireless Intercom",Washer,Dryer,Essentials}
## 3 {TV,Internet,"Wireless Internet",Kitchen,"Indoor Fireplace",Heating,"Family/Kid Friendly",Washer,Essentials,Shampoo}
## 4 {"Wireless Internet",Kitchen,Heating,Washer,Essentials}
## 5 {"Wireless Internet",Kitchen,"Smoking Allowed","Pets Allowed",Breakfast,"Elevator in Building",Heating,"Family/Kid Friendly",Washer,Dryer,Essentials,Shampoo}
## 6 {TV,Internet,"Wireless Internet",Kitchen,Heating,"Family/Kid Friendly",Washer,Dryer,"Smoke Detector","Fire Extinguisher",Essentials}
## square_feet price weekly_price monthly_price security_deposit cleaning_fee
## 1 NA $60.00 $388.00 $200.00 $20.00
## 2 NA $200.00
## 3 NA $80.00 $501.00 $1,503.00 $501.00
## 4 NA $60.00 $250.00
## 5 NA $50.00
## 6 NA $191.00 $50.00
## guests_included extra_people minimum_nights maximum_nights calendar_updated
## 1 1 $0.00 1 1125 5 months ago
## 2 1 $0.00 1 1125 11 months ago
## 3 1 $0.00 3 1125 today
## 4 0 $0.00 2 1125 8 months ago
## 5 1 $0.00 1 30 4 weeks ago
## 6 1 $0.00 3 1125 5 months ago
## has_availability availability_30 availability_60 availability_90
## 1 NA 0 0 0
## 2 NA 0 0 0
## 3 NA 6 23 23
## 4 NA 29 59 89
## 5 NA 29 59 89
## 6 NA 0 0 0
## availability_365 calendar_last_scraped number_of_reviews first_review
## 1 0 2016-07-03 1 2015-05-19
## 2 0 2016-07-04 0
## 3 298 2016-07-04 1 2015-10-10
## 4 364 2016-07-03 1 2015-12-15
## 5 89 2016-07-04 2 2016-06-17
## 6 0 2016-07-04 0
## last_review review_scores_rating review_scores_accuracy
## 1 2015-05-19 100 10
## 2 NA NA
## 3 2015-10-10 80 NA
## 4 2015-12-15 80 6
## 5 2016-06-17 100 10
## 6 NA NA
## review_scores_cleanliness review_scores_checkin review_scores_communication
## 1 10 10 10
## 2 NA NA NA
## 3 NA NA NA
## 4 10 8 10
## 5 10 10 10
## 6 NA NA NA
## review_scores_location review_scores_value requires_license license
## 1 10 10 f
## 2 NA NA f
## 3 NA NA f
## 4 6 8 f
## 5 10 10 f
## 6 NA NA f
## jurisdiction_names instant_bookable cancellation_policy
## 1 Paris f flexible
## 2 Paris f flexible
## 3 Paris f flexible
## 4 Paris f flexible
## 5 Paris f flexible
## 6 Paris f flexible
## require_guest_profile_picture require_guest_phone_verification
## 1 f f
## 2 f f
## 3 f f
## 4 f f
## 5 f f
## 6 f f
## calculated_host_listings_count reviews_per_month
## 1 1 0.07
## 2 1 NA
## 3 1 0.11
## 4 1 0.15
## 5 1 2.00
## 6 1 NA
It should be mentioned that the following libraries must be installed as they provide necessary tools for our analysis.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.3.2
At this stage, we started to select the most appropriate parameters that should be consistent with objectives of the project. That is why we excluded the observations consisting of descriptions. As a result, we reduced the number of parameters from 95 to 37
test2 <- select(test2, host_id, host_name, host_since, host_response_rate, host_acceptance_rate, host_is_superhost, host_total_listings_count, host_identity_verified, neighbourhood_cleansed, latitude, longitude, is_location_exact, property_type, room_type, accommodates, bathrooms, bedrooms, beds, bed_type, price, guests_included, extra_people, minimum_nights, maximum_nights, availability_30, availability_60, availability_90, availability_365, number_of_reviews, first_review, instant_bookable, cancellation_policy, require_guest_profile_picture, require_guest_phone_verification, calculated_host_listings_count, reviews_per_month, review_scores_rating)
dim(test2)
## [1] 52725 37
` The following is a summary of our data set. It can be noticed that some variables have missing values therefore we need to transform them before making some conclusions in the analysis. By the way we also see zero values for parameters which should not have any (bedrooms, bathrooms, etc) that is why we need to investigate and pre-process them too.
summary(test2)
## host_id host_name host_since host_response_rate
## Min. : 2626 Marie : 583 2012-05-04: 166 100% :26619
## 1st Qu.: 6158190 Nicolas : 436 2012-06-18: 165 N/A :12517
## Median :15885410 Pierre : 418 2012-10-25: 155 90% : 2524
## Mean :22485601 Caroline: 388 2014-03-10: 135 80% : 1567
## 3rd Qu.:34348717 Anne : 387 2015-07-29: 128 50% : 949
## Max. :81397049 Sophie : 372 2013-07-20: 116 70% : 676
## (Other) :50141 (Other) :51860 (Other): 7873
## host_acceptance_rate host_is_superhost host_total_listings_count
## 100% :19680 : 46 Min. : 0.00
## N/A :15591 f:50513 1st Qu.: 1.00
## 0% : 1377 t: 2166 Median : 1.00
## 50% : 1292 Mean : 5.83
## 67% : 1149 3rd Qu.: 2.00
## 75% : 915 Max. :1024.00
## (Other):12721 NA's :46
## host_identity_verified neighbourhood_cleansed latitude
## : 46 Buttes-Montmartre : 6025 Min. :48.81
## f:25730 Popincourt : 4883 1st Qu.:48.85
## t:26949 Vaugirard : 3878 Median :48.86
## Batignolles-Monceau: 3603 Mean :48.86
## Entrepôt : 3466 3rd Qu.:48.88
## Passy : 3074 Max. :48.91
## (Other) :27796
## longitude is_location_exact property_type
## Min. :2.221 f: 7369 Apartment :50663
## 1st Qu.:2.323 t:45356 Loft : 567
## Median :2.347 House : 537
## Mean :2.344 Bed & Breakfast: 394
## 3rd Qu.:2.369 Condominium : 266
## Max. :2.475 Other : 122
## (Other) : 176
## room_type accommodates bathrooms bedrooms
## Entire home/apt:45177 Min. : 1.000 Min. :0.00 Min. : 0.000
## Private room : 7001 1st Qu.: 2.000 1st Qu.:1.00 1st Qu.: 1.000
## Shared room : 547 Median : 2.000 Median :1.00 Median : 1.000
## Mean : 3.051 Mean :1.09 Mean : 1.059
## 3rd Qu.: 4.000 3rd Qu.:1.00 3rd Qu.: 1.000
## Max. :16.000 Max. :8.00 Max. :10.000
## NA's :243 NA's :193
## beds bed_type price guests_included
## Min. : 0.000 Airbed : 35 $60.00 : 3055 Min. : 0.000
## 1st Qu.: 1.000 Couch : 1182 $50.00 : 3047 1st Qu.: 1.000
## Median : 1.000 Futon : 449 $70.00 : 2787 Median : 1.000
## Mean : 1.684 Pull-out Sofa: 5066 $80.00 : 2598 Mean : 1.353
## 3rd Qu.: 2.000 Real Bed :45993 $100.00: 2073 3rd Qu.: 2.000
## Max. :16.000 $90.00 : 2031 Max. :16.000
## NA's :80 (Other):37134
## extra_people minimum_nights maximum_nights availability_30
## $0.00 :37324 Min. : 1.000 Min. :1.000e+00 Min. : 0.00
## $10.00 : 4453 1st Qu.: 1.000 1st Qu.:6.000e+01 1st Qu.: 0.00
## $20.00 : 2653 Median : 2.000 Median :1.125e+03 Median : 8.00
## $15.00 : 2469 Mean : 3.128 Mean :1.253e+05 Mean :11.65
## $5.00 : 1179 3rd Qu.: 3.000 3rd Qu.:1.125e+03 3rd Qu.:23.00
## $30.00 : 989 Max. :1000.000 Max. :2.147e+09 Max. :30.00
## (Other): 3658
## availability_60 availability_90 availability_365 number_of_reviews
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 2.00 1st Qu.: 6.00 1st Qu.: 22.0 1st Qu.: 0.00
## Median :26.00 Median :37.00 Median :183.0 Median : 3.00
## Mean :27.33 Mean :41.18 Mean :179.5 Mean : 12.59
## 3rd Qu.:50.00 3rd Qu.:75.00 3rd Qu.:336.0 3rd Qu.: 13.00
## Max. :60.00 Max. :90.00 Max. :365.0 Max. :392.00
##
## first_review instant_bookable cancellation_policy
## :14508 f:44186 flexible :19244
## 2016-05-08: 212 t: 8539 moderate :15039
## 2016-06-13: 193 strict :18427
## 2016-01-03: 186 super_strict_30: 6
## 2016-01-02: 183 super_strict_60: 9
## 2015-09-21: 173
## (Other) :37270
## require_guest_profile_picture require_guest_phone_verification
## f:51816 f:51014
## t: 909 t: 1711
##
##
##
##
##
## calculated_host_listings_count reviews_per_month review_scores_rating
## Min. : 1.000 Min. : 0.010 Min. : 20.00
## 1st Qu.: 1.000 1st Qu.: 0.360 1st Qu.: 87.00
## Median : 1.000 Median : 0.900 Median : 93.00
## Mean : 4.087 Mean : 1.336 Mean : 91.01
## 3rd Qu.: 1.000 3rd Qu.: 1.870 3rd Qu.: 97.00
## Max. :155.000 Max. :14.290 Max. :100.00
## NA's :14508 NA's :15454
Initially, there were 30524 missing values, however after some processing their number was increased to 73419 values.
sum(is.na(test2))
## [1] 30524
test2[test2 == ""] <- NA
test2[test2 == "N/A"] <- NA
sum(is.na(test2))
## [1] 73419
The following parameters have the most of the missing values which should be replaced or omitted.
NAs_qty = colSums(is.na(test2))
NAs_prop = round(NAs_qty/nrow(test2), 3)
NAs.df <- data.frame(NAs_qty, NAs_prop)
NAs.df[NAs.df$NAs_prop > 0.01, ]
## NAs_qty NAs_prop
## host_response_rate 12563 0.238
## host_acceptance_rate 15637 0.297
## first_review 14508 0.275
## reviews_per_month 14508 0.275
## review_scores_rating 15454 0.293
Some of those features require certain transformations of the data
type in order to use them properly. It is also worth noting that
first_review determines the visit frequency that is why it
was decided to replace some omissions with values of
host_since as we consider that this assumption may be
relevant for hosts who has just one apartment.
test2$host_response_rate <- as.numeric(sub("%", " ", test2$host_response_rate))
test2$host_acceptance_rate <- as.numeric(sub("%", " ", test2$host_acceptance_rate))
test2 <- test2 %>% mutate(first_review = case_when (
is.na(first_review) & calculated_host_listings_count == 1 ~ host_since,
.default = first_review ))
test2 = test2 %>%
mutate(first_review = as.Date(paste(first_review,sep='-')))
The final outcomes of the modified features must be presented in the form of histograms as we need to estimate the data distribution.
par(mfrow = c(2,2))
hist(test2$host_response_rate,breaks = 30, col="lavender", main = "Host response rate",xlab="host response rate")
hist(test2$host_acceptance_rate,breaks = 30, col="lavender", main = "Acceptance rate",xlab="host acceptance rate")
hist(test2$reviews_per_month,breaks = 30, col="lavender", main = "Reviews per month",xlab="reviews per month")
hist(test2$review_scores_rating ,breaks = 30, col="lavender", main = "Review scores rating",xlab="review scores rating")
According to these diagrams, it can be said that the missing values might be replaced with median which is a mean for skewed distributions.
test2$host_response_rate[is.na(test2$host_response_rate)] <- median(test2$host_response_rate, na.rm = TRUE)
test2$host_acceptance_rate[is.na(test2$host_acceptance_rate)] <- median(test2$host_acceptance_rate, na.rm = TRUE)
test2$reviews_per_month[is.na(test2$reviews_per_month)] <- median(test2$reviews_per_month, na.rm = TRUE)
test2$review_scores_rating[is.na(test2$review_scores_rating)] <- median(test2$review_scores_rating, na.rm = TRUE)
Now we check the data frame of the missing values again in order to justify the possibility to exclude the rest of these data. As it can be seen below that maximum of omissions operates at 5% therefore they can be neglected.
NAs_qty = colSums(is.na(test2))
NAs_prop = round(NAs_qty/nrow(test2), 3)
NAs.df <- data.frame(NAs_qty, NAs_prop)
NAs.df[order(-NAs.df$NAs_prop),]
## NAs_qty NAs_prop
## first_review 2766 0.052
## bathrooms 243 0.005
## bedrooms 193 0.004
## beds 80 0.002
## host_name 46 0.001
## host_since 46 0.001
## host_is_superhost 46 0.001
## host_total_listings_count 46 0.001
## host_identity_verified 46 0.001
## host_id 0 0.000
## host_response_rate 0 0.000
## host_acceptance_rate 0 0.000
## neighbourhood_cleansed 0 0.000
## latitude 0 0.000
## longitude 0 0.000
## is_location_exact 0 0.000
## property_type 3 0.000
## room_type 0 0.000
## accommodates 0 0.000
## bed_type 0 0.000
## price 0 0.000
## guests_included 0 0.000
## extra_people 0 0.000
## minimum_nights 0 0.000
## maximum_nights 0 0.000
## availability_30 0 0.000
## availability_60 0 0.000
## availability_90 0 0.000
## availability_365 0 0.000
## number_of_reviews 0 0.000
## instant_bookable 0 0.000
## cancellation_policy 0 0.000
## require_guest_profile_picture 0 0.000
## require_guest_phone_verification 0 0.000
## calculated_host_listings_count 0 0.000
## reviews_per_month 0 0.000
## review_scores_rating 0 0.000
test2 <- na.omit(test2)
It is also worth to say that price and
extra_people were converted to numerical value in this
project.
pattern <- "\\$(\\d+)"
test2$price <- as.numeric(str_match((str_replace_all(test2$price, ",", "")), pattern)[,2])
test2$extra_people <- as.numeric(str_match((str_replace_all(test2$extra_people, ",", "")), pattern)[,2])
As it was mentioned earlier, some observations have zero values which should be definitely replaced. The following steps are directed exactly at this purpose.
zero_qty = colSums(test2[,1:37]==0)
zero_prop = round(zero_qty/nrow(test2), 3)
zero.df <- data.frame(zero_qty, zero_prop)
zero.df <- arrange(zero.df,desc(zero_qty))
zero.df[zero.df$zero_qty != 0, ]
## zero_qty zero_prop
## extra_people 34784 0.702
## availability_30 14080 0.284
## number_of_reviews 11659 0.235
## availability_60 11569 0.234
## availability_90 10445 0.211
## bedrooms 10030 0.203
## availability_365 8255 0.167
## guests_included 2099 0.042
## host_acceptance_rate 1323 0.027
## bathrooms 82 0.002
## host_total_listings_count 5 0.000
## price 2 0.000
## beds 1 0.000
The same replacing procedure is presented below where median substitutes for zero values.
par(mfrow = c(1,2))
hist(test2$number_of_reviews ,breaks = 30, col="lavender", main = "Number of reviews",xlab="number of reviews")
hist(test2$host_acceptance_rate ,breaks = 30, col="lavender", main = "Host acceptance rate",xlab="host acceptance rate")
test2$number_of_reviews[test2$number_of_reviews == 0] <- median(test2$number_of_reviews, na.rm = TRUE)
test2$host_acceptance_rate[test2$host_acceptance_rate == 0] <- median(test2$host_acceptance_rate, na.rm = TRUE)
The beds parameter should be processed differently
therefore we need to calculate the amount of observations separately to
each category. As it can be seen, most of the data is divided into three
main groups.
beds_zero_val <- select(test2,bedrooms, beds) %>% filter(bedrooms == 0) %>% group_by(beds)%>%
mutate(Counts = n()) %>% summarise(beds_qty = unique(Counts))
beds_zero_val
## # A tibble: 6 × 2
## beds beds_qty
## <int> <int>
## 1 1 8131
## 2 2 1774
## 3 3 108
## 4 4 14
## 5 5 2
## 6 7 1
After that we assumed that it can be appropriate to determine the bed’s categories based on the relationship between bedrooms and beds.
test2 %>%
ggplot(aes(x=factor(bedrooms), y=beds, fill=factor(bedrooms))) +
geom_boxplot(show.legend = FALSE) + ggtitle("The relationship between beds and bedrooms") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5))
According to this box plot, the following conditional algorithm was implemented.
test2 <- test2 %>%
mutate(bedrooms = ifelse(bedrooms == 0 & beds == 1, median(test2$bedrooms[test2$beds == 1], na.rm = TRUE),
ifelse(bedrooms == 0 & beds == 2, median(test2$bedrooms[test2$beds == 2], na.rm = TRUE),ifelse(bedrooms == 0 & beds > 2, 3, test2$bedrooms)
)))
For the next features, it was decided to filter them as the proportion of zero values of less than 1%
test2 <- test2 %>% filter(bathrooms > 0)
test2 <- test2 %>% filter(host_total_listings_count> 0)
test2 <- test2 %>% filter(price > 0)
test2 <- test2 %>% filter(beds > 0)
As the result, we have 49431 values which is only 7% less than we had before these manipulations.
dim(test2)
## [1] 49431 37
1) Visit frequency of the different quarters according to time
It can be seen that all neighborhoods have almost similar distributions of the visit frequency. So it can supposed that location does not have so much influence on the tourist choice.
for (i in unique(test2$neighbourhood_cleansed)){
hist(test2$first_review[test2$neighbourhood_cleansed == i],breaks = "month", col="lavender", main = i,xlab="Date")
}
2) Number of apartments per owner
Using host_id we could to calculate the number of
apartments per owner. As the result we have 44175 individual owners. The
top 50 hosts of this list is presented below in the form of histogram.
Some names are repeated that is why they were overlapped with each
other.
appart_per_owner <- select(test2, host_id, host_name) %>% group_by(host_id) %>% mutate(Counts = n())
appart_per_owner <- appart_per_owner[!duplicated(appart_per_owner$host_id),] %>%
arrange(desc(Counts))
appart_per_owner
## # A tibble: 44,175 × 3
## # Groups: host_id [44,175]
## host_id host_name Counts
## <int> <fct> <int>
## 1 2288803 Fabien 76
## 2 3972699 Hanane 63
## 3 3943828 Caroline 60
## 4 12984381 Olivier 51
## 5 11593703 Rudy And Benjamin 47
## 6 7612270 Paul 46
## 7 2667370 Parisian Home 43
## 8 152242 Delphine 41
## 9 13013633 Benjamin 40
## 10 3971743 Diane 40
## # ℹ 44,165 more rows
ggplot(data=appart_per_owner[1:50,], aes(x=(reorder(host_name, -Counts)), y=Counts)) +
geom_bar(stat="identity", color="black", fill="red")+
geom_text(aes(label=Counts), vjust=-0.3, size=2.5) + xlab("owner's name") + ylab("apart_num") +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) + ggtitle("Number of apartments of top 50 hosts") + theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5))
3) Relationship between prices and apartment features
To start with, it was necessary to include a new feature
log_price in the data set as it was required to normalize
price parameter for the further exploitation.
test2$log_price <- log(test2$price)
All these different box plots demonstrate the relationships between prices and apartment features. In general, we see the increasing trend which indicates that the prices depend on the number of the apartment features. However, the increase of the number of bathrooms has a positive effect until 4.5 after that the downward trend prevails.
ggplot(data = test2) +
geom_boxplot(aes(x=factor(beds),y=log_price, fill=factor(beds))) + xlab("beds") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bed"))
ggplot(data = test2) +
geom_boxplot(aes(x=factor(bedrooms),y=log_price, fill=factor(bedrooms))) + xlab("bedrooms") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bedroom"))
ggplot(data = test2) +
geom_boxplot(aes(x=factor(bathrooms),y=log_price, fill=factor(bathrooms))) + xlab("bathrooms") + ggtitle("The relationship between prices and apartment feature") + theme(
plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) + guides(fill = guide_legend(title = "Bathroom"))
The group division was also investigated where we can observe the same trend for all of the neighborhoods.
ggplot(data = test2) +
geom_boxplot(aes(x=factor(beds),y=log_price)) +
facet_wrap(~ neighbourhood_cleansed) +
theme_minimal(base_size=8.5) + xlab("beds")
ggplot(data = test2) +
geom_boxplot(aes(x=factor(bedrooms),y=log_price)) +
facet_wrap(~ neighbourhood_cleansed) +
theme_minimal(base_size=13) + xlab("bedrooms")
ggplot(data = test2) +
geom_boxplot(aes(x=factor(bathrooms),y=log_price)) +
facet_wrap(~ neighbourhood_cleansed)+
theme_minimal(base_size=8) +
theme(axis.text.x = element_text(angle=90, vjust=0.5)) + xlab("bathrooms")
The last diagram differentiates the quarters (these include Elysee, Passy, Palais-Bourbon, Luxembourg, etc) with most expensive apartments relatively to the main features .
ggplot(data = test2) +
geom_point(aes(x=bedrooms,y=bathrooms,size=price, col=log_price)) + scale_colour_gradient(low = "white", high = "black") +
facet_wrap(~ neighbourhood_cleansed, nrow=3)
4) Renting price per city quarter (“arroundissments”)
The mean price was calculated for each quarter. In general, the results prove the previous statement about the neighborhoods.
neighbor_price <- test2 %>% select(neighbourhood_cleansed, price) %>%
group_by(neighbourhood_cleansed) %>%
mutate(average_price_per_neighb = mean(price)) %>% summarise(unique(average_price_per_neighb))
colnames(neighbor_price) = c("Neighbourhood","Average_price")
neighbor_price[order(-neighbor_price$Average_price),]
## # A tibble: 20 × 2
## Neighbourhood Average_price
## <fct> <dbl>
## 1 Élysée 154.
## 2 Luxembourg 141.
## 3 Louvre 136.
## 4 Palais-Bourbon 134.
## 5 Hôtel-de-Ville 128.
## 6 Passy 118.
## 7 Bourse 117.
## 8 Temple 115.
## 9 Panthéon 112.
## 10 Opéra 96.7
## 11 Vaugirard 89.5
## 12 Batignolles-Monceau 87.6
## 13 Observatoire 81.6
## 14 Entrepôt 81.0
## 15 Popincourt 78.6
## 16 Reuilly 76.4
## 17 Buttes-Montmartre 73.8
## 18 Gobelins 72.5
## 19 Buttes-Chaumont 66.1
## 20 Ménilmontant 65.9
The following diagrams show the variance between different neighborhoods where the richest quarters can be easily identified.
ggplot(test2, aes(x=neighbourhood_cleansed, y=log_price, fill=neighbourhood_cleansed)) +
geom_boxplot(show.legend = FALSE) + coord_flip() + xlab("neighborhood")
ggplot(test2, aes(x=neighbourhood_cleansed, y=log_price, fill=neighbourhood_cleansed)) +
geom_violin(trim=FALSE) + xlab("neighbourhood") +
ggtitle("Renting price per city quarter") + theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust = 0.5)) +
theme(axis.text.x = element_text(angle=90, hjust=1)) + geom_boxplot(width=0.1)
ggplot(data = test2, mapping = aes(x = price, y = neighbourhood_cleansed)) +
geom_density_ridges(mapping = aes(fill = neighbourhood_cleansed), bandwidth = 130, alpha = .6, size = 1) +
theme_ridges() +
xlab("Price") +
ylab("") +
ggtitle("Price behavior ") + xlim(-250,2000) + guides(fill = guide_legend(title = "Neighborhood"))
## Warning: Removed 5 rows containing non-finite values (`stat_density_ridges()`).
It was also interesting to represent the price range on the map
(using leaflet tool) to observe the difference between
neighborhoods. For this purpose we created a new feature
price_group. From this map it can be concluded that the
closer to the center the more expensive it gets.
test2 <- test2 %>% mutate(price_group=ifelse(price < 50, "Low", ifelse(price > 50 & price < 100, "Moderate", "High" )))
pal <- colorFactor(palette = c("red", "green", "blue"), domain = test2$price_group)
leaflet(data = test2) %>% addProviderTiles(providers$CartoDB.Positron) %>% addCircleMarkers(~longitude, ~latitude, color = ~pal(price_group), weight = 1, radius=1, fillOpacity = 0.1, opacity = 0.1, label = paste("Neighbourhood:", test2$neighbourhood_cleansed)) %>%
addLegend("bottomright", pal = pal, values = ~price_group,
title = "Price groups",
opacity = 1
)
#save(test2,file='Airbnb_cleansed.Rdata')